Solheim: Evaluating Model Performance: Recall, Precision, F1 Score, and Kappa using Tidyverse

Øyvind Bugge Solheim; ChatGPT (Ghost Writer)

Disclaimer: This post is written by an AI language model based on R code provided by the author. The purpose is to document and explain R techniques for personal reference.

Introduction

Evaluating model performance is crucial for understanding how well your machine learning models are working.. In this post, we’ll explore different metrics, including recall, precision, F1 score, and kappa statistics, which can help assess the accuracy and reliability of your models.. We’ll simplify the implementation using the tidyverse package and pipes, assuming you have a dataset named prediction for predicted values and correct for actual values with matching variable names.. Additionally, we emphasize the importance of correctly specifying factor levels when working with binary classification data.. Incorrect level ordering can lead to invalid or misleading metric calculations, which we will demonstrate and address..

Step-by-Step Guide

1. Load Required Libraries

We’ll use the pacman package to load irr, caret, and tidyverse for calculating metrics and managing data efficiently..

pacman::p_load("irr", "caret", "tidyverse", "gt")

2. Create Example Datasets

We’ll create example datasets prediction and correct to demonstrate the evaluation process.. These datasets will have matching variable names and contain binary classification data.. Important Note: The factor levels must be correctly specified, with 1 representing the positive class and 0 the negative class.. If the levels are reversed (e.g., levels = 0:1), the metrics will be computed incorrectly..

set.seed(123)
# Create example datasets
prediction <- tibble(
Formål_1 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_2 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_3 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_4 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_5 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_2 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_3 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_4 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_5 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_6 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0))
)
correct <- tibble(
Formål_1 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_2 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_3 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_4 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_5 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_2 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_3 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_4 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_5 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_6 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0))
)

3. The Impact of Incorrect Factor Levels

When factor levels are specified incorrectly (e.g., levels = 0:1), the positive and negative classes are reversed.. This can lead to incorrect metric calculations, as the model will treat the negative class as the positive class.. For instance:

incorrect_prediction <- factor(sample(c(0, 1), 100, replace = TRUE), levels = c(0, 1))
incorrect_reference <- factor(sample(c(0, 1), 100, replace = TRUE), levels = c(0, 1))
# Incorrect calculation
recall(data = incorrect_prediction, reference = incorrect_reference)

[1] 0.5961538

4. Calculate Performance Metrics

compute_metrics <- function(variable_name, prediction_data, correct_data) {
tibble(
Variabel = variable_name,
rec = recall(data = prediction_data[[variable_name]], reference = correct_data[[variable_name]]),
F1 = F_meas(data = prediction_data[[variable_name]], reference = correct_data[[variable_name]]),
prec = precision(data = prediction_data[[variable_name]], reference = correct_data[[variable_name]])
)
}
variables <- colnames(prediction)
results <- map_df(variables, ~ compute_metrics(.x, prediction, correct))

5. Calculate Kappa and Agreement

compute_kappa_agreement <- function(variable_name, prediction_data, correct_data) {
tibble(
Variabel = variable_name,
Kappa = kappa2(cbind(prediction_data[[variable_name]], correct_data[[variable_name]]))$value,
Agreement = agree(cbind(prediction_data[[variable_name]], correct_data[[variable_name]]))$value
)
}
kappa_agreement_results <- map_df(variables, ~ compute_kappa_agreement(.x, prediction, correct))

6. Combine Results

final_results <- results %>%
left_join(kappa_agreement_results, by = "Variabel")
gt(final_results) %>% fmt_number(2:5)

Variabel	rec	F1	prec	Kappa	Agreement
Formål_1	0.40	0.43	0.47	−0.06	47
Formål_2	0.67	0.64	0.61	0.26	63
Formål_3	0.47	0.47	0.47	−0.04	48
Formål_4	0.45	0.46	0.47	0.04	53
Formål_5	0.53	0.52	0.50	0.06	53
Avsender_2	0.56	0.54	0.53	−0.03	49
Avsender_3	0.53	0.49	0.45	0.01	50
Avsender_4	0.49	0.50	0.51	0.00	50
Avsender_5	0.41	0.43	0.45	−0.06	47
Avsender_6	0.49	0.46	0.44	−0.06	47

Evaluating Model Performance: Recall, Precision, F1 Score, and Kappa using Tidyverse